17 research outputs found

    Explainability in Deep Learning by Means of Communication

    Get PDF
    Die Forschung in Deep Learning erlebt herausragende Fortschritte, die zu vielen zukünftigen Anwendungen führen können, besonders im Bereich Computer Vision. Dennoch bleiben die meisten Teile eines auf maschinellem Lernen basierenden Systems verborgen und schwer dem Menschen zu erklären. Der daraus resultierende Mangel an Verständnis und Vertrauen beschränkt die Anwendung solcher Methoden in kritischen Szenarien wie dem Gesundheitssystem. In dieser Dissertation schlagen wir Modelle vor, die von Grund auf erklärbarer sind als die Basismodelle, auf denen sie aufbauen. Durch das Offenbaren von Teilen des Entscheidungsprozesses entwickeln wir Hilfsmittel, die es Menschen erlauben, die Stärken und Schwächen der Modelle besser zu erkennen und einzuschätzen, ob sie sich für einen Einsatz eignen. Die Forschung dieser Dissertation ist von der Art inspiriert wie Erklärungen durch Kommunikation entstehen. Wir präsentieren neuartige Herangehensweisen, welche verschiedene Aspekte des Kommunikationsablaufs in neuronale Netze integrieren, sodass sie interpretierbarer werden und/oder die Interaktion zwischen Mensch und Maschine verbessern. Kommunikation --- und damit auch unsere Sprache --- ist ein natürlicher Weg für Menschen eine Erklärung zu verfassen, sodass ein Benutzer von erklärbaren Systemen keiner Lernphase ausgesetzt ist, um mit der künstlichen Intelligenz interagieren zu können. Insbesondere schlagen wir eine Multiagenten-Kommunikationsumgebung vor, bei der die Nachrichten zwischen den Agenten einem Ja/Nein Frage-Antwort-Diskurs ähneln. Agenten, die trainiert wurden, Bilder zu klassifizieren, bauen dabei einen Entscheidungsbaum, der den Ablauf der Vorhersagen als Ganzes beschreibt. Während dieser Ansatz eine einheitliche Erklärung für eine Problemstellung findet, stellen wir fest, dass Menschen vielfältig in ihrer Kommunikation, Sprache und Wahrnehmung sind. Aus diesem Grund erschaffen wir einen Agenten, der zielgerichtete Nachrichten erstellt und seine Vorgehensweise anpasst, indem er seinen Kommunikationspartner dabei beobachtet, wie effektiv dieser die übertragenen Informationen in Taten umsetzen kann. Im Anschluss zeigen wir, dass effiziente Kommunikation auch als Maßstab verwendet werden kann, der, wenn man sich ihn als Ziel setzt, zu hoch komprimierten Abstraktionen und zu interpretierbaren Erkenntnissen führt. Schließlich werfen wir einen umfassenderen Blick darauf, inwiefern semantische Informationen, z.B. von Sprache, Vision-Modelle bereichern und sie erklärbarer machen können.Research in deep learning has seen extraordinary advances at a fast pace leading to the emergence to many prospective applications, especially in the computer vision domain. Nonetheless, most of the machine learning pipeline remains opaque and hard to explain to humans, limiting their real-world use due to the lack of understanding and trust to deploy these systems in critical scenarios such as health care. In this thesis, we propose machine learning models that are inherently more explainable than the base model they are derived from. By exposing parts of the decision-making process and increasing the overall transparency, we develop tools that allow humans to better evaluate strengths and weaknesses of the machine learning models as well as assess their suitability for deployment. The research of this thesis takes inspiration from how explanations are formed between humans through communication. We present novel approaches that look at different aspects of the communication process and embed them into neural network models to make them more interpretable and/or more integrated into human-computer interactions. Communication --- and to a greater extend human language --- is a natural way for humans to compose explanations, such that explanation systems would not impose onto the users a learning curve of interacting with the artificial intelligence. Specifically, we propose a multi-agent communication setting, where messages between the agents resemble a yes/no question-answer discourse. Agents trained to solve an image classification task, learn to build a decision tree that globally describes the prediction procedure. We find that human-understandable side-information is key for making the framework truly explainable. While this approach finds one globally viable explanation for a given problem, humans are diverse in their own communication, languages, and perception. To this end, we build an agent that creates purposefully crafted messages by observing its communication partner in their effectiveness in acting upon the communicated data and adjusting its own policy accordingly. Then, we show that efficient communication can also be used as a metric that, when optimized for, leads to both highly compressed abstractions and interpretable insights on humanly drawn sketches and sketch-based tasks. Finally, we take a broader look at how semantic information, e.g., from language, can enrich vision models and make them more explainable

    Iterative Superquadric Recomposition of 3D Objects from Multiple Views

    Full text link
    Humans are good at recomposing novel objects, i.e. they can identify commonalities between unknown objects from general structure to finer detail, an ability difficult to replicate by machines. We propose a framework, ISCO, to recompose an object using 3D superquadrics as semantic parts directly from 2D views without training a model that uses 3D supervision. To achieve this, we optimize the superquadric parameters that compose a specific instance of the object, comparing its rendered 3D view and 2D image silhouette. Our ISCO framework iteratively adds new superquadrics wherever the reconstruction error is high, abstracting first coarse regions and then finer details of the target object. With this simple coarse-to-fine inductive bias, ISCO provides consistent superquadrics for related object parts, despite not having any semantic supervision. Since ISCO does not train any neural network, it is also inherently robust to out-of-distribution objects. Experiments show that, compared to recent single instance superquadrics reconstruction approaches, ISCO provides consistently more accurate 3D reconstructions, even from images in the wild. Code available at https://github.com/ExplainableML/ISCO .Comment: Accepted at ICCV 202

    DeViL: Decoding Vision features into Language

    Full text link
    Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. Our DeViL method decodes vision features into language, not only highlighting the attribution locations but also generating textual descriptions of visual features at different layers of the network. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language. By employing dropout both per-layer and per-spatial-location, our model can generalize training on image-text pairs to generate localized explanations. As it uses a pre-trained language model, our approach is fast to train, can be applied to any vision backbone, and produces textual descriptions at different layers of the vision network. Moreover, DeViL can create open-vocabulary attribution maps corresponding to words or phrases even outside the training scope of the vision model. We demonstrate that DeViL generates textual descriptions relevant to the image content on CC3M surpassing previous lightweight captioning models and attribution maps uncovering the learned concepts of the vision backbone. Finally, we show DeViL also outperforms the current state-of-the-art on the neuron-wise descriptions of the MILANNOTATIONS dataset. Code available at https://github.com/ExplainableML/DeViLComment: Accepted at GCPR 2023 (Oral

    In-Context Impersonation Reveals Large Language Models' Strengths and Biases

    Full text link
    In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs' impersonations are complementary to visual information when describing different categories. We find that impersonation can improve performance: an LLM prompted to be a bird expert describes birds better than one prompted to be a car expert. However, impersonation can also uncover LLMs' biases: an LLM prompted to be a man describes cars better than one prompted to be a woman. These findings demonstrate that LLMs are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their hidden strengths and biases

    PDiscoNet: Semantically consistent part discovery for fine-grained recognition

    Full text link
    Fine-grained classification often requires recognizing specific object parts, such as beak shape and wing patterns for birds. Encouraging a fine-grained classification model to first detect such parts and then using them to infer the class could help us gauge whether the model is indeed looking at the right details better than with interpretability methods that provide a single attribution map. We propose PDiscoNet to discover object parts by using only image-level class labels along with priors encouraging the parts to be: discriminative, compact, distinct from each other, equivariant to rigid transforms, and active in at least some of the images. In addition to using the appropriate losses to encode these priors, we propose to use part-dropout, where full part feature vectors are dropped at once to prevent a single part from dominating in the classification, and part feature vector modulation, which makes the information coming from each part distinct from the perspective of the classifier. Our results on CUB, CelebA, and PartImageNet show that the proposed method provides substantially better part discovery performance than previous methods while not requiring any additional hyper-parameter tuning and without penalizing the classification performance. The code is available at https://github.com/robertdvdk/part_detection.Comment: 9 pages, 8 figures, ICC

    Compositional Mixture Representations for Vision and Text

    Full text link
    Learning a common representation space between vision and language allows deep networks to relate objects in the image to the corresponding semantic meaning. We present a model that learns a shared Gaussian mixture representation imposing the compositionality of the text onto the visual domain without having explicit location supervision. By combining the spatial transformer with a representation learning approach we learn to split images into separately encoded patches to associate visual and textual representations in an interpretable manner. On variations of MNIST and CIFAR10, our model is able to perform weakly supervised object detection and demonstrates its ability to extrapolate to unseen combination of objects.Comment: Workshop on Learning with Limited Labelled Data for Image and Video Understanding (L3D-IVU), CVPR 202
    corecore